CS 188 Fall Introduction to Artificial Intelligence Midterm 1. ˆ You have approximately 2 hours and 50 minutes.

Size: px
Start display at page:

Download "CS 188 Fall Introduction to Artificial Intelligence Midterm 1. ˆ You have approximately 2 hours and 50 minutes."

Transcription

1 CS 188 Fall 2013 Introduction to Artificial Intelligence Midterm 1 ˆ You have approximately 2 hours and 50 minutes. ˆ The exam is closed book, closed notes except your one-page crib sheet. ˆ Please use non-programmable calculators only. ˆ Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. All short answer sections can be successfully answered in a few sentences AT MOST. First name Last name SID edx username First and last name of student to your left First and last name of student to your right For staff use only: Q1. Search /9 Q2. InvisiPac /9 Q3. CSPs /27 Q4. Utilities /10 Q5. Games: Three-Player Cookie Pruning /9 Q6. The nature of discounting /10 Q7. The Value of Games /10 Q8. Infinite Time to Study /16 Total /100 1

2 THIS PAGE IS INTENTIONALLY LEFT BLANK

3 Q1. [9 pts] Search A 1 4 B 1 C 5 3 D E F 2 5 G Node h 1 h 2 A B 9 12 C 8 10 D 7 8 E F G 0 0 Consider the state space graph shown above. A is the start state and G is the goal state. The costs for each edge are shown on the graph. Each edge can be traversed in both directions. Note that the heuristic h 1 is consistent but the heuristic h 2 is not consistent. (a) [4 pts] Possible paths returned For each of the following graph search strategies (do not answer for tree search), mark which, if any, of the listed paths it could return. Note that for some search strategies the specific path returned might depend on tie-breaking behavior. In any such cases, make sure to mark all paths that could be returned under some tie-breaking scheme. Search Algorithm A-B-D-G A-C-D-G A-B-C-D-F-G Depth first search x x x Breadth first search x x Uniform cost search x A* search with heuristic h 1 x A* search with heuristic h 2 x The return paths depend on tie-breaking behaviors so any possible path has to be marked. DFS can return any path. BFS will return all the shallowest paths, i.e. A-B-D-G and A-C-D-G. A-B-C-D-F-G is the optimal path for this problem, so that UCS and A* using consistent heuristic h 1 will return that path. Although, h 2 is not consistent, it will also return this path. (b) Heuristic function properties Suppose you are completing the new heuristic function h 3 shown below. All the values are fixed except h 3 (B). Node A B C D E F G h 3 10? For each of the following conditions, write the set of values that are possible for h 3 (B). For example, to denote all non-negative numbers, write [0, ], to denote the empty set, write, and so on. (i) [1 pt] What values of h 3 (B) make h 3 admissible? To make h 3 admissible, h 3 (B) has to be less than or equal to the actual optimal cost from B to goal G, which is the cost of path B-C-D-F-G, i.e. 12. The answer is 0 h 3 (B) 12 (ii) [2 pts] What values of h 3 (B) make h 3 consistent? All the other nodes except node B satisfy the consistency conditions. The consistency conditions that do involve the state B are: h(a) c(a, B) + h(b) h(c) c(c, B) + h(b) h(d) c(d, B) + h(b) h(b) c(b, A) + h(a) h(b) c(b, C) + h(c) h(b) c(b, D) + h(d) 3

4 Filling in the numbers shows this results in the condition: 9 h 3 (B) 10 (iii) [2 pts] What values of h 3 (B) will cause A* graph search to expand node A, then node C, then node B, then node D in order? The A* search tree using heuristic h 3 is on the right. In order to make A* graph search expand node A, then node C, then node B, suppose h 3 (B) = x, we need A x > x < 14 (expand B ) or 1 + x < 14 (expand B) f=1+x B C f=4+9=13 so we can get 12 < h 3 (B) < 13 f=5+x B D f=7+7=14 4

5 Q2. [9 pts] InvisiPac Pacman finds himself to have an invisible friend, InvisiPac. Whenever InvisiPac visits a square with a food pellet, InvisiPac will eat that food pellet giving away its location at that time. Suppose the maze s size is MxN and there are F food pellets at the beginning. Pacman and InvisiPac alternate moves. Pacman can move to any adjacent square (including the one where InvisiPac is) that are not walls, just as in the regular game. After Pacman moves, InvisiPac can teleport into any of the four squares that are adjacent to Pacman, as marked with the dashed circle in the graph. InvisiPac can occupy wall squares. (a) For this subquestion, whenever InvisiPac moves, it chooses randomly from the squares adjacent to Pacman. The dots eaten by InvisiPac don t count as Pacman s score. Pacman s task is to eat as many food pellets as possible. (i) [1 pt] Which of the following is best suited to model this problem from Pacman s perspective? state space search CSP minimax game MDP RL InvisiPac moves to each adjacent square randomly with probably 1 4. From pacman s point of view, it is a MDP problem with the transition function reflecting this uncertainty. (ii) [2 pts] What is the size of a minimal state space for this problem? Give your answer as a product of factors that reference problem quantities such as M, N, F, etc. as appropriate. Below each factor, state the information it encodes. For example, you might write 4 M N and write number of directions underneath the first term and Pacman s position under the second. 2 F MN (boolean vector for whether each food has been eaten, pacman s position) (b) For this subquestion, whenever InvisiPac moves, it always moves into the same square relative to Pacman. For example, if InvisiPac starts one square North of Pacman, InvisiPac will always move into the square North of Pacman. Pacman knows that InvisiPac is stuck this way, but doesn t know which of the four relative locations he is stuck in. As before, if InvisiPac ends up being in a square with a food pellet, it will eat it and Pacman will thereby find out InvisiPac s location. Pacman s task is to find a strategy that minimizes the worst-case number of moves it could take before Pacman knows InvisiPac s location. (i) [1 pt] Which of the following is best suited to model this problem from Pacman s perspective? state space search CSP minimax game MDP RL The invisipac will be stuck in one of the four squares relative to Pacman. It is a search problem and state space include the boolean vector for which each of the four locations invisipac might be. The goal is to reach a state only one possible location the invisipac can be. (ii) [2 pts] What is the size of a minimal state space for this problem? Give your answer as a product of factors that reference problem quantities such as M, N, F, etc. as appropriate. Below each factor, state the information it encodes. For example, you might write 4 M N and write number of directions underneath the first term and Pacman s position under the second. 2 F MN 2 4 (boolean vector for whether each food has been eaten, pacman s position, boolean vector for which each of the four locations invisipac might be) (c) For this subquestion, whenever InvisiPac moves, it can choose freely between any of the four squares adjacent to Pacman. InvisiPac tries to eat as many food pellets as possible. Pacman s task is to eat as many food pellets as possible. (i) [1 pt] Which of the following is best suited to model this problem from Pacman s perspective? state space search CSP minimax game MDP RL InvisiPac tries to eat as many food pellets as possible, thus plays adversially. problem. It is a minimax game 5

6 (ii) [2 pts] What is the size of a minimal state space for this problem? Give your answer as a product of factors that reference problem quantities such as M, N, F, etc. as appropriate. Below each factor, state the information it encodes. For example, you might write 4 M N and write number of directions underneath the first term and Pacman s position under the second. 2 F MN (boolean vector for whether each food has been eaten, pacman s position) 6

7 Q3. [27 pts] CSPs (a) Pacman s new house After years of struggling through mazes, Pacman has finally made peace with the ghosts, Blinky, Pinky, Inky, and Clyde, and invited them to live with him and Ms. Pacman. The move has forced Pacman to change the rooming assignments in his house, which has 6 rooms. He has decided to figure out the new assignments with a CSP in which the variables are Pacman (P), Ms. Pacman (M), Blinky (B), Pinky (K), Inky (I), and Clyde (C), the values are which room they will stay in, from 1-6, and the constraints are: i) No two agents can stay in the same room ii) P > 3 vi) B is even iii) K is less than P vii) I is not 1 or 6 iv) M is either 5 or 6 viii) I-C = 1 v) P > M ix) P-B = 2 (i) [1 pt] Unary constraints On the grid below cross out the values from each domain that are eliminated by enforcing unary constraints. P B C K I M The unary constraints are ii, iv, vi, and vii. ii crosses out 1,2, and 3 for P. iv crosses out 1,2,3,4 for M. vi crosses out 1,3, and 5 for B. vii crosses out 1 and 6 for I. K and C have no unary constraints, so their domains remain the same. (ii) [1 pt] MRV According to the Minimum Remaining Value (MRV) heuristic, which variable should be assigned to first? P B C K I M M has the fewest value remaining in its domain (2), so it should be selected first for assignment. (iii) [2 pts] Forward Checking For the purposes of decoupling this problem from your solution to the previous problem, assume we choose to assign P first, and assign it the value 6. What are the resulting domains after enforcing unary constraints (from part i) and running forward checking for this assignment? P 6 B C K I M In addition to enforcing the unary constraints from part i, the domains are further constrained by all constraints involving P. This includes constraints i, iii, v, and ix. i removes 6 from the domains of all variables. iii removes 6 from the domain of K (already removed by constraint i). v removes 6 from the domain of M (also already removed by i). ix removes 2 and 6 from the domain of B. (iv) [3 pts] Iterative Improvement Instead of running backtracking search, you decide to start over and run iterative improvement with the min-conflicts heuristic for value selection. Starting with the following assignment: P:6, B:4, C:3, K:2, I:1, M:5 First, for each variable write down how many constraints it violates in the table below. Then, in the table on the right, for all variables that could be selected for assignment, put an x in any box 7

8 that corresponds to a possible value that could be assigned to that variable according to min-conflicts. When marking next values a variable could take on, only mark values different from the current one. Variable # violated P 0 B 0 C 1 K 0 I 2 M P B C x K I x x M Both I and C violate constraint viii, because I-C =2. I also violates constraint vii. No other variables violate any constraints. According to iterative improvement, any conflicted variable could be selected for assignment, in this case I and C. According to min-conflicts, the values that those variables can take on are the values that minimize the number of constraints violated by the variable. Assigning 2 or 4 to I causes it to violate constraint i, because other variables already have the values 2 and 4. Assigning 2 to C also only causes C to violate 1 constraint. 8

9 (b) Variable ordering We say that a variable X is backtracked if, after a value has been assigned to X, the recursion returns at X without a solution, and a different value must be assigned to X. For this problem, consider the following three algorithms: 1. Run backtracking search with no filtering 2. Initially enforce arc consistency, then run backtracking search with no filtering 3. Initially enforce arc consistency, then run backtracking search while enforcing arc consistency after each assignment (i) [5 pts] For each algorithm, circle all orderings of variable assignments that guarantee that no backtracking will be necessary when finding a solution to the CSP represented by the following constraint graph. Algorithm 1 Algorithm 2 Algorithm 3 A-B-C-D-E-F A-B-C-D-E-F A-B-C-D-E-F F-E-D-C-B-A F-E-D-C-B-A F-E-D-C-B-A C-A-B-D-E-F C-A-B-D-E-F C-A-B-D-E-F B-D-A-F-E-C B-D-A-F-E-C B-D-A-F-E-C D-E-F-C-B-A D-E-F-C-B-A D-E-F-C-B-A B-C-D-A-E-F B-C-D-A-E-F B-C-D-A-E-F Algorithm 1: No filtering means that there are no guarantees that an assignment to one variable has consistent assignments in any other variable, so backtracking may be necessary. Algorithm 2: This algorithm is very similar to the tree-structured CSP algorithm presented in class, in which arcs are enforced from one right to left, and then variables are assigned from left to right. The arcs enforced in that algorithm are a subset of all arcs enforced when enforcing arc consistency. Thus, any linear ordering of variables in which each variable is assigned before all of its children in the tree will guarantee no backtracking. Algorithm 3: Any first assignment can be the root of a tree, which, from class, we know is consistent and will not require backtracking. This assignment can then be viewed as conditioning the graph on that variable, and after re-running arc consistency, it can be removed from the graph. This results in either one or two tree-structured graphs that are also arc consistent, and the process can be repeated. (ii) [5 pts] For each algorithm, circle all orderings of variable assignments that guarantee that no more than two variables will be backtracked when finding a solution to the CSP represented by the following constraint graph. 9

10 Algorithm 1 Algorithm 2 Algorithm 3 C-F-A-B-E-D-G-H C-F-A-B-E-D-G-H C-F-A-B-E-D-G-H F-C-A-H-E-B-D-G F-C-A-H-E-B-D-G F-C-A-H-E-B-D-G A-B-C-E-D-F-G-H A-B-C-E-D-F-G-H A-B-C-E-D-F-G-H G-C-H-F-B-D-E-A G-C-H-F-B-D-E-A G-C-H-F-B-D-E-A A-B-E-D-G-H-C-F A-B-E-D-G-H-C-F A-B-E-D-G-H-C-F A-D-B-G-E-H-C-F A-D-B-G-E-H-C-F A-D-B-G-E-H-C-F Algorithm 1: This might backtrack for the same reason as algorithm 1 for the previous problem. Algorithm 2: If the first two assignments are not a cutset (C-F, C-G, or F-B), the graph will still contain cycles, for which there is no guarantee that backtracking will not be necessary. If the first two assignments are a cutset, the remaining graph will be a tree. However, because arc consistency was not enforced after the assignment, there is no guarantee against further backtracking. To see this, consider the sub-graph A,B,C,E, with domains {1,2,3}, and constraints A=C, B A, E=C+2, B C, E=B. If this is assigned in the order C-A-B-E, then by assigning 1 to C and A, assigning either 1 or 2 to B would result in an empty domain for E and cause B to backtrack. Algorithm 3: After assigning the cutset, the remaining graph is a tree, which guarantees no further backtracking with algorithm 3 as seen in the previous problem. 10

11 (c) All Satisfying Assignments Now consider a modified CSP in which we wish to find every possible satisfying assignment, rather than just one such assignment as in normal CSPs. In order to solve this new problem, consider a new algorithm which is the same as the normal backtracking search algorithm, except that when it sees a solution, instead of returning it, the solution gets added to a list, and the algorithm backtracks. Once there are no variables remaining to backtrack on, the algorithm returns the list of solutions it has found. For each graph below, select whether or not using the MRV and/or LCV heuristics could affect the number of nodes expanded in the search tree in this new situation. The remaining parts all have a similar reasoning. Since every value has to be checked regardless of the outcome of previous assignments, the order in which the values are checked does not matter, so LCV has no effect. In the general case, in which there are constraints between variables, the size of each domain can vary based on the order in which variables are assigned, so MRV can still have an effect on the number of nodes expanded for the new find all solutions task. The one time that MRV is guaranteed to not have any effect is when the constraint graph is completely disconnected, as is the case for part i. In this case, the domains of each variable do not depend on any other variable s assignment. Thus, the ordering of variables does not matter, and MRV cannot have any effect on the number of nodes expanded. (i) [2 pts] Neither MRV nor LCV can have an effect. Only MRV can have an effect. Only LCV can have an effect. Both MRV and LCV can have an effect. (ii) [2 pts] Neither MRV nor LCV can have an effect. Only MRV can have an effect. Only LCV can have an effect. Both MRV and LCV can have an effect. (iii) [2 pts] Neither MRV nor LCV can have an effect. Only MRV can have an effect. Only LCV can have an effect. Both MRV and LCV can have an effect. (iv) [2 pts] Neither MRV nor LCV can have an effect. Only MRV can have an effect. Only LCV can have an effect. Both MRV and LCV can have an effect. 11

12 (v) [2 pts] Neither MRV nor LCV can have an effect. Only MRV can have an effect. Only LCV can have an effect. Both MRV and LCV can have an effect. 12

13 Q4. [10 pts] Utilities Pacman is buying a raffle ticket from the ghost raffle ticket vendor. There are two ticket types: A and B, but there are multiple specific tickets of each type. Pacman picks a ticket type, but the ghost will then choose which specific ticket Pacman will receive. Pacman s utility for a given raffle ticket is equal to the utility of the lottery of outcomes for that raffle ticket. For example, ticket R A,2 corresponds to a lottery with equal chances of yielding 10 and 0, and so U(R A,2 ) = U([ 1 2, 10; 1 2, 0]). Pacman, being a rational agent, wants to maximize his expected utility, but the ghost may have other goals! The outcomes are illustrated below. (a) Imagine that Pacman s utility for money is U(m) = m. (i) [2 pts] What are the utilities to Pacman of each raffle ticket? U(R A,1 ) = 6 U(R A,2 ) = 5 U(R B,1 ) = 6 U(R B,2 ) = 7 U(R B,1 ) = 1 3 U(3) U(6) U(9) = = 6 (ii) [1 pt] Which raffle ticket will Pacman receive under optimal play if the ghost is trying to minimize Pacman s utility (and Pacman knows the ghost is doing so)? (circle one) R A,1 R A,2 R B,1 R B,2 (iii) [1 pt] What is the equivalent monetary value of raffle ticket R B,1? U(m) = m = U(R B,1 ) = 6 m = 6 (b) Now imagine that Pacman s utility for money is given by U(m) = m 2. (i) [2 pts] What are the utilities to Pacman of each raffle ticket? U(R A,1 ) = 36 U(R A,2 ) = 50 U(R B,1 ) = 42 U(R B,2 ) = 49 U(R B,1 ) = 1 3 U(3) U(6) U(9) = = 42 (ii) [1 pt] The ghost is still trying to minimize Pacman s utility, but the ghost mistakenly thinks that Pacman s utility is given my U(m) = m, and Pacman is aware of this flaw in the ghost s model. Which raffle ticket will Pacman receive? (circle one) R A,1 R A,2 R B,1 R B,2 (iii) [1 pt] What is the equivalent monetary value of raffle ticket R B,1? (you may leave your answer as an expression) U(m) = m 2 = U(R B,1 ) = 42 m = 42 (c) [2 pts] Pacman has the raffle with distribution [0.5, $100; 0.5, $0]. A ghost insurance dealer offers Pacman an insurance policy where Pacman will get $100 regardless of what the outcome of the ticket is. If Pacman s utility for money is U(m) = m, what is the maximum amount of money Pacman would pay for this insurance? Call c the cost of the insurance. Pacman s utility with insurance: U($100 c) = 100 c Pacman s utility without insurance: U([0.5, $100; 0.5, $0]) = = 5 The maximum cost c Pacman would be willing to pay for the insurance is the cost c such that the two utilities are equal. 100 c = 5 c = 75 13

14 Q5. [9 pts] Games: Three-Player Cookie Pruning Three of your TAs, Alvin, Sergey, and James rent a cookie shuffler, which takes in a set number of cookies and groups them into 3 batches, one for each player. The cookie shuffler has three levers (with positions either UP or DOWN), which act to control how the cookies are distributed among the three players. Assume that 30 cookies are initially put into the shuffler. Each player controls one lever, and they act in turn. Alvin goes first, followed by Sergey, and finally James. Assume that all players are able to calculate the payoffs for every player at the terminal nodes. Assume the payoffs at the leaves correspond to the number of cookies for each player in their corresponding turn order. Hence, an utility of (7,10,13) corresponds to Alvin getting 7 cookies, Sergey getting 10 cookies, and James getting 13 cookies. No cookies are lost in the process, so the sum of cookies of all three players must equal the number of cookies put into the shuffler. Players want to maximize their own number of cookies. (a) [3 pts] What is the utility triple propagated up to the root? 15,12,3 (b) [6 pts] Is pruning possible in this game? Fill in Yes or No. If yes, cross out all nodes (both leaves and intermediate nodes) that get pruned. If no, explain in one sentence why pruning is not possible. Assume the tree traversal goes from left to right. Yes. No, Reasoning: Left lowest subtree: James chooses the node (15,12,3) at node J1. This gets propagated up to S1. Sergey doesn t know what value he can get on his right child, so we explore that. Upon propagating (8,6,16) to J2, we must explore J2 s right child since there could be a triple better for James best option (greater than 16) and better than Sergey s best option. (greater than 12). (15,12,3) gets propagated up to A1. Alvin might have a good option (greater than 15) in the right subtree, so we explore down. On the (3,20,7) node, we propagate this up to J3. We need to continue going down this path because Sergey doesn t know if he can get more than 20, and Alvin doesn t know if he can get more than 15 (A1 s value). Hence, (12,12,6) is explored. (3,20,7) is propagated to S2. Now, we can guarantee that Sergey will prefer any cookie count over 20. But, because the sum of cookies must be 30, this means that Alvin can get no more than 10 cookies in the right subtree. Hence, we can immediately prune any children of S2. 14

15 Q6. [10 pts] The nature of discounting Pacman in stuck in a friendlier maze where he gets a reward every time he visits state (0,0). This setup is a bit different from the one you ve seen before: Pacman can get the reward multiple times; these rewards do not get used up like food pellets and there are no living rewards. As usual, Pacman can not move through walls and may take any of the following actions: go North ( ), South ( ), East ( ), West ( ), or stay in place ( ). State (0,0) gives a total reward of 1 every time Pacman takes an action in that state regardless of the outcome, and all other states give no reward. The first sentence in the paragraph above was confusing at exam time. The precise reward function is: R (0,0),a = 1 for any action a and R s,a = 0 for all s (0, 0) You should not need to use any other complicated algorithm/calculations to answer the questions below. We remind you that geometric series converge as follows: 1 + γ + γ 2 + = 1/(1 γ). (a) [2 pts] Assume finite horizon of h = 10 (so Pacman takes exactly 10 steps) and no discounting (γ = 1). Fill in an optimal policy: Fill in the value function: (available actions:,,,, ) (b) The following Q-values correspond to the value function you specified above. 9 steps to go = R s + Vs where s is the successor of state s after taking actions a (i) [1 pt] The Q value of state-action (0, 0), (East) is: 9 10 steps to go Qs,a (ii) [1 pt] The Q value of state-action (1, 1), (East) is: 4 (c) Assume finite horizon of h = 10, no discounting, but the action to stay in place is temporarily (for this sub-point only) unavailable. Actions that would make Pacman hit a wall are not available. Specifically, Pacman can not use actions North or West to remain in state (0, 0) once he is there. (i) [1 pt] [true or false] There is just one optimal action at state (0, 0) East and South are both optimal actions (ii) [1 pt] The value of state (0, 0) is: 5 Since the stay action is no longer available, Pacman needs to exit state (0, 0) at even time steps (d) [2 pts] Assume infinite horizon, discount factor γ = 0.9. The value of state (0, 0) is: 1/(1 γ) = 10 (e) [2 pts] Assume infinite horizon and no discount (γ = 1). At every time step, after Pacman takes an action and collects his reward, a power outage could suddenly end the game with probability α =

16 The value of state (0, 0) is: 1/α = 10 16

17 Q7. [10 pts] The Value of Games Pacman is the model of rationality and seeks to maximize his expected utility, but that doesn t mean he never plays games. (a) [3 pts] Q-Learning to Play under a Conspiracy. Pacman does tabular Q-learning (where every stateaction pair has its own Q-value) to figure out how to play a game against the adversarial ghosts. As he likes to explore, Pacman always plays a random action. After enough time has passed, every state-action pair is visited infinitely often. The learning rate decreases as needed. For any game state s, the value max a Q(s, a) for the learned Q(s, a) is equal to (for complete search trees) The minimax value where Pacman maximizes and ghosts minimize. The expectimax value where Pacman maximizes and ghosts act uniformly at random. The expectimax value where Pacman plays uniformly at random and ghosts minimize. The expectimax value where both Pacman and ghosts play uniformly at random. None of the above. Only minimax search correctly models the adversarial game of Pacman s learned policy: although the acting policy is random, the learned policy is the optimal policy for max. Tabular Q-learning and full-depth minimax search both compute the exact value of all states, since Q-learning has a value for every state-action (and thus every state) and the conditions are right for convergence. (b) [3 pts] Feature-based Q-Learning the Game under a Conspiracy. Pacman now runs feature-based Q-learning. The Q-values are equal to the evaluation function n i=1 w if i (s, a) for weights w and features f. The number of features is much less than the number of states. As he likes to explore, Pacman always plays a random action. After enough time has passed, every state-action pair is visited infinitely often. The learning rate decreases as needed. The value max a Q(s, a) for the learned Q(s, a) is equal to (for complete search trees) The minimax value where Pacman maximizes and ghosts minimize and the same evaluation function is used at the leaves. The expectimax value where Pacman maximizes and ghosts act uniformly at random and the same evaluation function is used at the leaves. The expectimax value where Pacman plays uniformly at random and ghosts minimize and the same evaluation function is used at the leaves. The expectimax value where both Pacman and ghosts play uniformly at random and the same evaluation function is used at the leaves. None of the above. Full-depth minimax search computes the approximate value of all the leaves by the evaluation function and then exactly propagates these values up the search tree. Feature-based Q-learning approximates the value of all states with the evaluation function and not only the leaves. Since there are fewer features than states, the approximation is not expressive enough to capture the true values of all the states. (c) [2 pts] A Costly Game. Pacman is now stuck playing a new game with only costs and no payoff. Instead of maximizing expected utility V (s), he has to minimize expected costs J(s). In place of a reward function, there is a cost function C(s, a, s ) for transitions from s to s by action a. We denote the discount factor by γ (0, 1). J (s) is the expected cost incurred by the optimal policy. Which one of the following equations is satisfied by J? J (s) = min a s [C(s, a, s ) + γ max a T (s, a, s ) J (s )] J (s) = min s a T (s, a, s )[C(s, a, s ) + γ J (s )] J (s) = min a s T (s, a, s )[C(s, a, s ) + γ max s J (s )] J (s) = min s J a T (s, a, s )[C(s, a, s ) + γ max s J (s )] (s) = min a s T (s, a, s )[C(s, a, s ) + γ J (s )] J (s) = min s a [C(s, a, s ) + γ J (s )] 17

18 Minimum expected cost has the same form as maximum expected utility except that the optimization is in the opposite direction and costs replace rewards. (d) [2 pts] It s a conspiracy again! The ghosts have rigged the costly game so that once Pacman takes an action they can pick the outcome from all states s S (s, a), the set of all s with non-zero probability according to T (s, a, s ). Choose the correct Bellman-style equation for Pacman against the adversarial ghosts. J (s) = min a max s T (s, a, s )[C(s, a, s ) + γ J (s )] J (s) = min s a T (s, a, s )[max s C(s, a, s ) + γ J (s )] J (s) = min a min s [C(s, a, s ) + γ max s J (s )] J (s) = min a max s [C(s, a, s ) + γ J (s )] J (s) = min s a T (s, a, s )[max s C(s, a, s ) + γ max s J (s )] J (s) = min a min s T (s, a, s )[C(s, a, s ) + γ J (s )] Pacman is still minimizing cost, but instead of expected cost it is worst-case (maximum) cost among all possible successors s. The transition probability T (s, a, s ) is dropped since the worst-case outcome is selected with certainty. 18

19 Q8. [16 pts] Infinite Time to Study Pacman lives in a calm gridworld. S is the start state and double-squares are exit states. In exits, the only action available is exit, which earns the associated reward and transitions to a terminal state X (not shown). In normal states, the actions are to move to neighboring squares (for example, S has the single action ) and they always succeed. There is no living reward, so all non-exit actions have reward 0. Throughout the problem the discount γ = 1. The calmworld State names The Q-learning update equation is Q (s, a) = (1 α)q(s, a) + α[r(s, a, s ) + max a Q(s, a )]. However, this problem can be solved without manually computing any Q-value updates. (a) [2 pts] What are the optimal values of S and A? V (S) = 10 V (A) = 10 In a deterministic undiscounted (γ = 1) MDP the optimal value is the maximum return from the state. Pacman doesn t know the details of this gridworld so he does Q-learning with a learning rate of 0.5 and all Q-values initialized to 0 to figure it out. Consider the following sequence of transitions in the calmworld: s a s r S A 0 A E1 0 E1 exit X 1 S A 0 A E10 0 E10 exit X 10 (b) [2 pts] Circle the Q-values that are non-zero after these episodes. Q(S, ) Q(A, ) Q(A, ) Q(E1, exit) Q(E10, exit) Q-values are only updated when a transition is experienced. Q(E1, exit), Q(E10, exit) are updated to the reward earned, but the other states were updated when all the Qs were still zero. (c) [2 pts] What do the Q-values converge to if these episodes are repeated infinitely with a constant learning rate of 0.5? Write none if they do not converge. 19

20 The MDP is undiscounted and deterministic, so Q-learning converges even though the learning rate is constant. With infinite visits the Q-values will converge to the true values. Q(S, ) = 10. Q(S, ) is the only state-action for S, so it converges to the optimal value V (S). Q(A, ) = 0. The episode A, is never experienced so it is unchanged after initialization. Q(A, ) = 1. The only return possible after A, is 1. (Q-learning details reminder: assume α = 0.5 and the Q-values are initialized to 0.) It s vortex season in the gridworld. In the vortex state the only action is escape, which delivers Pacman to a neighboring state uniformly at random. The vortexworld (d) [2 pts] What are the optimal values of S and A in the vortex gridworld? The optimal value is the mean of the end returns 1 and 10 because the exit states have equal probability. The value of S is the same as A since the discount γ = 1 and the transition S,, A is deterministic. The transition A, escape, S has no impact on the value because the MDP is undiscounted / infinite horizon. V (S) = 5.5 V (A) = 5.5 Consider the following sequences of transitions in the vortexworld: S1 s a s r S A 0 A escape E1 0 E1 exit X 1 S A 0 A escape E10 0 E10 exit X 10 S2 s a s r S A 0 A escape E1 0 E1 exit X 1 S A 0 A escape E10 0 E10 exit X 10 S A 0 A escape E10 0 E10 exit X 10 (e) [2 pts] What do the Q-values converge to if the sequence S1 is repeated infinitely with appropriately decreasing learning rate? Write never if they do not converge. 20

21 Q S1 (S, ) = 5.5 Q S1 (A, escape) = 5.5 The conditions for convergence are satisfied and the Q-values converge to the expected return. The expectation of returns is (f) [2 pts] What if the sequence S2 is repeated instead? Q S2 (S, ) = 7 Q S2 (A, escape) = 7 The expectation of returns is because two out of three exits in the sequence have reward 10. (g) [2 pts] Which is the true optimum Q (S, ) in the vortex gridworld? Circle the answer. Q S1 (S, ) Q S2 (S, ) other The sequence S1 has the same distribution of returns as the true distribution, even though all of the possible transitions are not experienced. (h) [2 pts] Q-learning with constant α = 1 and visiting state-actions infinitely often converges in calmworld in vortexworld in neither world For learning rate α = 1 the Q-learning update sets Q(s, a) to the sample [R(s, a, s ) + max a Q(s, a )] with no regard for the previous value of Q(s, a). In deterministic MDPs (like calmworld), even with constant learning rate α = 1, Q-learning converges. In fact, this learning rate is optimal for deterministic MDPs in the sense that it converges fastest. In stochastic MDPs (like vortexworld), with constant learning rate α = 1, the Q(s, a)s are always equal to the most recent sample for the state-action (s, a). The Q(s, a)s will cycle among the possible samples and never converge. 21

CS 188 Fall Introduction to Artificial Intelligence Midterm 1. ˆ You have approximately 2 hours and 50 minutes.

CS 188 Fall Introduction to Artificial Intelligence Midterm 1. ˆ You have approximately 2 hours and 50 minutes. CS 188 Fall 2013 Introduction to Artificial Intelligence Midterm 1 ˆ You have approximately 2 hours and 50 minutes. ˆ The exam is closed book, closed notes except your one-page crib sheet. ˆ Please use

More information

To earn the extra credit, one of the following has to hold true. Please circle and sign.

To earn the extra credit, one of the following has to hold true. Please circle and sign. CS 188 Fall 2018 Introduction to Artificial Intelligence Practice Midterm 1 To earn the extra credit, one of the following has to hold true. Please circle and sign. A I spent 2 or more hours on the practice

More information

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. CS 188 Spring 2015 Introduction to Artificial Intelligence Midterm 1 You have approximately 2 hours and 50 minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib

More information

Q1. [?? pts] Search Traces

Q1. [?? pts] Search Traces CS 188 Spring 2010 Introduction to Artificial Intelligence Midterm Exam Solutions Q1. [?? pts] Search Traces Each of the trees (G1 through G5) was generated by searching the graph (below, left) with a

More information

The exam is closed book, closed calculator, and closed notes except your three crib sheets.

The exam is closed book, closed calculator, and closed notes except your three crib sheets. CS 188 Spring 2016 Introduction to Artificial Intelligence Final V2 You have approximately 2 hours and 50 minutes. The exam is closed book, closed calculator, and closed notes except your three crib sheets.

More information

CS188 Spring 2012 Section 4: Games

CS188 Spring 2012 Section 4: Games CS188 Spring 2012 Section 4: Games 1 Minimax Search In this problem, we will explore adversarial search. Consider the zero-sum game tree shown below. Trapezoids that point up, such as at the root, represent

More information

Midterm I. Introduction to Artificial Intelligence. CS 188 Fall You have approximately 3 hours.

Midterm I. Introduction to Artificial Intelligence. CS 188 Fall You have approximately 3 hours. CS 88 Fall 202 Introduction to Artificial Intelligence Midterm I You have approximately 3 hours. The exam is closed book, closed notes except a one-page crib sheet. Please use non-programmable calculators

More information

Introduction to Fall 2011 Artificial Intelligence Midterm Exam

Introduction to Fall 2011 Artificial Intelligence Midterm Exam CS 188 Introduction to Fall 2011 Artificial Intelligence Midterm Exam INSTRUCTIONS You have 3 hours. The exam is closed book, closed notes except a one-page crib sheet. Please use non-programmable calculators

More information

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. CS 188 Spring 2016 Introduction to Artificial Intelligence Midterm V2 You have approximately 2 hours and 50 minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib

More information

Introduction to Fall 2011 Artificial Intelligence Midterm Exam

Introduction to Fall 2011 Artificial Intelligence Midterm Exam CS 188 Introduction to Fall 2011 Artificial Intelligence Midterm Exam INSTRUCTIONS You have 3 hours. The exam is closed book, closed notes except a one-page crib sheet. Please use non-programmable calculators

More information

CEC login. Student Details Name SOLUTIONS

CEC login. Student Details Name SOLUTIONS Student Details Name SOLUTIONS CEC login Instructions You have roughly 1 minute per point, so schedule your time accordingly. There is only one correct answer per question. Good luck! Question 1. Searching

More information

Introduction to Artificial Intelligence Midterm 1. CS 188 Spring You have approximately 2 hours.

Introduction to Artificial Intelligence Midterm 1. CS 188 Spring You have approximately 2 hours. CS 88 Spring 0 Introduction to Artificial Intelligence Midterm You have approximately hours. The exam is closed book, closed notes except your one-page crib sheet. Please use non-programmable calculators

More information

Introduction to Fall 2007 Artificial Intelligence Final Exam

Introduction to Fall 2007 Artificial Intelligence Final Exam NAME: SID#: Login: Sec: 1 CS 188 Introduction to Fall 2007 Artificial Intelligence Final Exam You have 180 minutes. The exam is closed book, closed notes except a two-page crib sheet, basic calculators

More information

CS 188 Fall Introduction to Artificial Intelligence Midterm 1

CS 188 Fall Introduction to Artificial Intelligence Midterm 1 CS 188 Fall 2018 Introduction to Artificial Intelligence Midterm 1 You have 120 minutes. The time will be projected at the front of the room. You may not leave during the last 10 minutes of the exam. Do

More information

CS 188: Artificial Intelligence

CS 188: Artificial Intelligence CS 188: Artificial Intelligence Markov Decision Processes Dan Klein, Pieter Abbeel University of California, Berkeley Non Deterministic Search Example: Grid World A maze like problem The agent lives in

More information

CSEP 573: Artificial Intelligence

CSEP 573: Artificial Intelligence CSEP 573: Artificial Intelligence Markov Decision Processes (MDP)! Ali Farhadi Many slides over the course adapted from Luke Zettlemoyer, Dan Klein, Pieter Abbeel, Stuart Russell or Andrew Moore 1 Outline

More information

Non-Deterministic Search

Non-Deterministic Search Non-Deterministic Search MDP s 1 Non-Deterministic Search How do you plan (search) when your actions might fail? In general case, how do you plan, when the actions have multiple possible outcomes? 2 Example:

More information

Reinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein

Reinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein Reinforcement Learning Slides based on those used in Berkeley's AI class taught by Dan Klein Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent s utility is defined by the

More information

CS 188: Artificial Intelligence

CS 188: Artificial Intelligence CS 188: Artificial Intelligence Markov Decision Processes Dan Klein, Pieter Abbeel University of California, Berkeley Non-Deterministic Search 1 Example: Grid World A maze-like problem The agent lives

More information

CSE 473: Artificial Intelligence

CSE 473: Artificial Intelligence CSE 473: Artificial Intelligence Markov Decision Processes (MDPs) Luke Zettlemoyer Many slides over the course adapted from Dan Klein, Stuart Russell or Andrew Moore 1 Announcements PS2 online now Due

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley 2. RN, AIMA Stochastic domains Image: Berkeley CS188 course notes (downloaded Summer

More information

CS 6300 Artificial Intelligence Spring 2018

CS 6300 Artificial Intelligence Spring 2018 Expectimax Search CS 6300 Artificial Intelligence Spring 2018 Tucker Hermans thermans@cs.utah.edu Many slides courtesy of Pieter Abbeel and Dan Klein Expectimax Search Trees What if we don t know what

More information

Logistics. CS 473: Artificial Intelligence. Markov Decision Processes. PS 2 due today Midterm in one week

Logistics. CS 473: Artificial Intelligence. Markov Decision Processes. PS 2 due today Midterm in one week CS 473: Artificial Intelligence Markov Decision Processes Dan Weld University of Washington [Slides originally created by Dan Klein & Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent s utility is defined by the reward function Must (learn to) act so as to maximize expected rewards Grid World The agent

More information

91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010

91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010 91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010 Lecture 17 & 18: Markov Decision Processes Oct 12 13, 2010 A subset of Lecture 9 slides from Dan Klein UC Berkeley Many slides over the course

More information

CS 343: Artificial Intelligence

CS 343: Artificial Intelligence CS 343: Artificial Intelligence Markov Decision Processes II Prof. Scott Niekum The University of Texas at Austin [These slides based on those of Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC

More information

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements CS 188: Artificial Intelligence Spring 2011 Lecture 9: MDPs 2/16/2011 Pieter Abbeel UC Berkeley Many slides over the course adapted from either Dan Klein, Stuart Russell or Andrew Moore 1 Announcements

More information

CS360 Homework 14 Solution

CS360 Homework 14 Solution CS360 Homework 14 Solution Markov Decision Processes 1) Invent a simple Markov decision process (MDP) with the following properties: a) it has a goal state, b) its immediate action costs are all positive,

More information

To earn the extra credit, one of the following has to hold true. Please circle and sign.

To earn the extra credit, one of the following has to hold true. Please circle and sign. CS 188 Fall 2018 Introduction to rtificial Intelligence Practice Midterm 2 To earn the extra credit, one of the following has to hold true. Please circle and sign. I spent 2 or more hours on the practice

More information

16 MAKING SIMPLE DECISIONS

16 MAKING SIMPLE DECISIONS 247 16 MAKING SIMPLE DECISIONS Let us associate each state S with a numeric utility U(S), which expresses the desirability of the state A nondeterministic action A will have possible outcome states Result

More information

CS 188: Artificial Intelligence. Outline

CS 188: Artificial Intelligence. Outline C 188: Artificial Intelligence Markov Decision Processes (MDPs) Pieter Abbeel UC Berkeley ome slides adapted from Dan Klein 1 Outline Markov Decision Processes (MDPs) Formalism Value iteration In essence

More information

Introduction to Artificial Intelligence Spring 2019 Note 2

Introduction to Artificial Intelligence Spring 2019 Note 2 CS 188 Introduction to Artificial Intelligence Spring 2019 Note 2 These lecture notes are heavily based on notes originally written by Nikhil Sharma. Games In the first note, we talked about search problems

More information

CS221 / Spring 2018 / Sadigh. Lecture 9: Games I

CS221 / Spring 2018 / Sadigh. Lecture 9: Games I CS221 / Spring 2018 / Sadigh Lecture 9: Games I Course plan Search problems Markov decision processes Adversarial games Constraint satisfaction problems Bayesian networks Reflex States Variables Logic

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley 2. AIMA 3. Chris Amato Stochastic domains So far, we have studied search Can use

More information

2D5362 Machine Learning

2D5362 Machine Learning 2D5362 Machine Learning Reinforcement Learning MIT GALib Available at http://lancet.mit.edu/ga/ download galib245.tar.gz gunzip galib245.tar.gz tar xvf galib245.tar cd galib245 make or access my files

More information

Lecture 9: Games I. Course plan. A simple game. Roadmap. Machine learning. Example: game 1

Lecture 9: Games I. Course plan. A simple game. Roadmap. Machine learning. Example: game 1 Lecture 9: Games I Course plan Search problems Markov decision processes Adversarial games Constraint satisfaction problems Bayesian networks Reflex States Variables Logic Low-level intelligence Machine

More information

16 MAKING SIMPLE DECISIONS

16 MAKING SIMPLE DECISIONS 253 16 MAKING SIMPLE DECISIONS Let us associate each state S with a numeric utility U(S), which expresses the desirability of the state A nondeterministic action a will have possible outcome states Result(a)

More information

COMP417 Introduction to Robotics and Intelligent Systems. Reinforcement Learning - 2

COMP417 Introduction to Robotics and Intelligent Systems. Reinforcement Learning - 2 COMP417 Introduction to Robotics and Intelligent Systems Reinforcement Learning - 2 Speaker: Sandeep Manjanna Acklowledgement: These slides use material from Pieter Abbeel s, Dan Klein s and John Schulman

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Ryan P. Adams COS 324 Elements of Machine Learning Princeton University We now turn to a new aspect of machine learning, in which agents take actions and become active in their

More information

17 MAKING COMPLEX DECISIONS

17 MAKING COMPLEX DECISIONS 267 17 MAKING COMPLEX DECISIONS The agent s utility now depends on a sequence of decisions In the following 4 3grid environment the agent makes a decision to move (U, R, D, L) at each time step When the

More information

The exam is closed book, closed notes except a two-page crib sheet. Non-programmable calculators only.

The exam is closed book, closed notes except a two-page crib sheet. Non-programmable calculators only. CS 188 Spring 2011 Introduction to Artificial Intelligence Final Exam INSTRUCTIONS You have 3 hours. The exam is closed book, closed notes except a two-page crib sheet. Non-programmable calculators only.

More information

CS 188: Artificial Intelligence Fall 2011

CS 188: Artificial Intelligence Fall 2011 CS 188: Artificial Intelligence Fall 2011 Lecture 9: MDPs 9/22/2011 Dan Klein UC Berkeley Many slides over the course adapted from either Stuart Russell or Andrew Moore 2 Grid World The agent lives in

More information

Expectimax Search Trees. CS 188: Artificial Intelligence Fall Expectimax Quantities. Expectimax Pseudocode. Expectimax Pruning?

Expectimax Search Trees. CS 188: Artificial Intelligence Fall Expectimax Quantities. Expectimax Pseudocode. Expectimax Pruning? CS 188: Artificial Intelligence Fall 2010 Expectimax Search Trees What if we don t know what the result of an action will be? E.g., In solitaire, next card is unknown In minesweeper, mine locations In

More information

TDT4171 Artificial Intelligence Methods

TDT4171 Artificial Intelligence Methods TDT47 Artificial Intelligence Methods Lecture 7 Making Complex Decisions Norwegian University of Science and Technology Helge Langseth IT-VEST 0 helgel@idi.ntnu.no TDT47 Artificial Intelligence Methods

More information

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Piyush Rai CS5350/6350: Machine Learning November 29, 2011 Reinforcement Learning Supervised Learning: Uses explicit supervision

More information

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Piyush Rai CS5350/6350: Machine Learning November 29, 2011 Reinforcement Learning Supervised Learning: Uses explicit supervision

More information

Expectimax Search Trees. CS 188: Artificial Intelligence Fall Expectimax Example. Expectimax Pseudocode. Expectimax Pruning?

Expectimax Search Trees. CS 188: Artificial Intelligence Fall Expectimax Example. Expectimax Pseudocode. Expectimax Pruning? CS 188: Artificial Intelligence Fall 2011 Expectimax Search Trees What if we don t know what the result of an action will be? E.g., In solitaire, next card is unknown In minesweeper, mine locations In

More information

CS 188: Artificial Intelligence Fall 2011

CS 188: Artificial Intelligence Fall 2011 CS 188: Artificial Intelligence Fall 2011 Lecture 7: Expectimax Search 9/15/2011 Dan Klein UC Berkeley Many slides over the course adapted from either Stuart Russell or Andrew Moore 1 Expectimax Search

More information

Algorithms and Networking for Computer Games

Algorithms and Networking for Computer Games Algorithms and Networking for Computer Games Chapter 4: Game Trees http://www.wiley.com/go/smed Game types perfect information games no hidden information two-player, perfect information games Noughts

More information

Lecture 17: More on Markov Decision Processes. Reinforcement learning

Lecture 17: More on Markov Decision Processes. Reinforcement learning Lecture 17: More on Markov Decision Processes. Reinforcement learning Learning a model: maximum likelihood Learning a value function directly Monte Carlo Temporal-difference (TD) learning COMP-424, Lecture

More information

Complex Decisions. Sequential Decision Making

Complex Decisions. Sequential Decision Making Sequential Decision Making Outline Sequential decision problems Value iteration Policy iteration POMDPs (basic concepts) Slides partially based on the Book "Reinforcement Learning: an introduction" by

More information

CMPSCI 311: Introduction to Algorithms Second Midterm Practice Exam SOLUTIONS

CMPSCI 311: Introduction to Algorithms Second Midterm Practice Exam SOLUTIONS CMPSCI 311: Introduction to Algorithms Second Midterm Practice Exam SOLUTIONS November 17, 2016. Name: ID: Instructions: Answer the questions directly on the exam pages. Show all your work for each question.

More information

Making Decisions. CS 3793 Artificial Intelligence Making Decisions 1

Making Decisions. CS 3793 Artificial Intelligence Making Decisions 1 Making Decisions CS 3793 Artificial Intelligence Making Decisions 1 Planning under uncertainty should address: The world is nondeterministic. Actions are not certain to succeed. Many events are outside

More information

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. CS 188 Summer 2015 Introduction to Artificial Intelligence Midterm 2 You have approximately 80 minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. Mark

More information

Yao s Minimax Principle

Yao s Minimax Principle Complexity of algorithms The complexity of an algorithm is usually measured with respect to the size of the input, where size may for example refer to the length of a binary word describing the input,

More information

Deep RL and Controls Homework 1 Spring 2017

Deep RL and Controls Homework 1 Spring 2017 10-703 Deep RL and Controls Homework 1 Spring 2017 February 1, 2017 Due February 17, 2017 Instructions You have 15 days from the release of the assignment until it is due. Refer to gradescope for the exact

More information

Reinforcement learning and Markov Decision Processes (MDPs) (B) Avrim Blum

Reinforcement learning and Markov Decision Processes (MDPs) (B) Avrim Blum Reinforcement learning and Markov Decision Processes (MDPs) 15-859(B) Avrim Blum RL and MDPs General scenario: We are an agent in some state. Have observations, perform actions, get rewards. (See lights,

More information

CS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning

CS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning CS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning Daniel M. Gaines Note: content for slides adapted from Sutton and Barto [1998] Introduction Animals learn through interaction

More information

CS 5522: Artificial Intelligence II

CS 5522: Artificial Intelligence II CS 5522: Artificial Intelligence II Uncertainty and Utilities Instructor: Alan Ritter Ohio State University [These slides were adapted from CS188 Intro to AI at UC Berkeley. All materials available at

More information

Markov Decision Process

Markov Decision Process Markov Decision Process Human-aware Robotics 2018/02/13 Chapter 17.3 in R&N 3rd Ø Announcement: q Slides for this lecture are here: http://www.public.asu.edu/~yzhan442/teaching/cse471/lectures/mdp-ii.pdf

More information

CS 343: Artificial Intelligence

CS 343: Artificial Intelligence CS 343: Artificial Intelligence Uncertainty and Utilities Instructors: Dan Klein and Pieter Abbeel University of California, Berkeley [These slides are based on those of Dan Klein and Pieter Abbeel for

More information

Monte-Carlo Planning Look Ahead Trees. Alan Fern

Monte-Carlo Planning Look Ahead Trees. Alan Fern Monte-Carlo Planning Look Ahead Trees Alan Fern 1 Monte-Carlo Planning Outline Single State Case (multi-armed bandits) A basic tool for other algorithms Monte-Carlo Policy Improvement Policy rollout Policy

More information

Intro to Reinforcement Learning. Part 3: Core Theory

Intro to Reinforcement Learning. Part 3: Core Theory Intro to Reinforcement Learning Part 3: Core Theory Interactive Example: You are the algorithm! Finite Markov decision processes (finite MDPs) dynamics p p p Experience: S 0 A 0 R 1 S 1 A 1 R 2 S 2 A 2

More information

Making Complex Decisions

Making Complex Decisions Ch. 17 p.1/29 Making Complex Decisions Chapter 17 Ch. 17 p.2/29 Outline Sequential decision problems Value iteration algorithm Policy iteration algorithm Ch. 17 p.3/29 A simple environment 3 +1 p=0.8 2

More information

CS 4100 // artificial intelligence

CS 4100 // artificial intelligence CS 4100 // artificial intelligence instructor: byron wallace (Playing with) uncertainties and expectations Attribution: many of these slides are modified versions of those distributed with the UC Berkeley

More information

CS221 / Spring 2018 / Sadigh. Lecture 7: MDPs I

CS221 / Spring 2018 / Sadigh. Lecture 7: MDPs I CS221 / Spring 2018 / Sadigh Lecture 7: MDPs I cs221.stanford.edu/q Question How would you get to Mountain View on Friday night in the least amount of time? bike drive Caltrain Uber/Lyft fly CS221 / Spring

More information

Lecture 7: MDPs I. Question. Course plan. So far: search problems. Uncertainty in the real world

Lecture 7: MDPs I. Question. Course plan. So far: search problems. Uncertainty in the real world Lecture 7: MDPs I cs221.stanford.edu/q Question How would you get to Mountain View on Friday night in the least amount of time? bike drive Caltrain Uber/Lyft fly CS221 / Spring 2018 / Sadigh CS221 / Spring

More information

Announcements. CS 188: Artificial Intelligence Spring Expectimax Search Trees. Maximum Expected Utility. What are Probabilities?

Announcements. CS 188: Artificial Intelligence Spring Expectimax Search Trees. Maximum Expected Utility. What are Probabilities? CS 188: Artificial Intelligence Spring 2010 Lecture 8: MEU / Utilities 2/11/2010 Announcements W2 is due today (lecture or drop box) P2 is out and due on 2/18 Pieter Abbeel UC Berkeley Many slides over

More information

343H: Honors AI. Lecture 7: Expectimax Search 2/6/2014. Kristen Grauman UT-Austin. Slides courtesy of Dan Klein, UC-Berkeley Unless otherwise noted

343H: Honors AI. Lecture 7: Expectimax Search 2/6/2014. Kristen Grauman UT-Austin. Slides courtesy of Dan Klein, UC-Berkeley Unless otherwise noted 343H: Honors AI Lecture 7: Expectimax Search 2/6/2014 Kristen Grauman UT-Austin Slides courtesy of Dan Klein, UC-Berkeley Unless otherwise noted 1 Announcements PS1 is out, due in 2 weeks Last time Adversarial

More information

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements CS 188: Artificial Intelligence Spring 2010 Lecture 8: MEU / Utilities 2/11/2010 Pieter Abbeel UC Berkeley Many slides over the course adapted from Dan Klein 1 Announcements W2 is due today (lecture or

More information

Expectimax and other Games

Expectimax and other Games Expectimax and other Games 2018/01/30 Chapter 5 in R&N 3rd Ø Announcement: q Slides for this lecture are here: http://www.public.asu.edu/~yzhan442/teaching/cse471/lectures/games.pdf q Project 2 released,

More information

CS 234 Winter 2019 Assignment 1 Due: January 23 at 11:59 pm

CS 234 Winter 2019 Assignment 1 Due: January 23 at 11:59 pm CS 234 Winter 2019 Assignment 1 Due: January 23 at 11:59 pm For submission instructions please refer to website 1 Optimal Policy for Simple MDP [20 pts] Consider the simple n-state MDP shown in Figure

More information

Uncertain Outcomes. CS 188: Artificial Intelligence Uncertainty and Utilities. Expectimax Search. Worst-Case vs. Average Case

Uncertain Outcomes. CS 188: Artificial Intelligence Uncertainty and Utilities. Expectimax Search. Worst-Case vs. Average Case CS 188: Artificial Intelligence Uncertainty and Utilities Uncertain Outcomes Instructor: Marco Alvarez University of Rhode Island (These slides were created/modified by Dan Klein, Pieter Abbeel, Anca Dragan

More information

MDPs: Bellman Equations, Value Iteration

MDPs: Bellman Equations, Value Iteration MDPs: Bellman Equations, Value Iteration Sutton & Barto Ch 4 (Cf. AIMA Ch 17, Section 2-3) Adapted from slides kindly shared by Stuart Russell Sutton & Barto Ch 4 (Cf. AIMA Ch 17, Section 2-3) 1 Appreciations

More information

Probabilities. CSE 473: Artificial Intelligence Uncertainty, Utilities. Reminder: Expectations. Reminder: Probabilities

Probabilities. CSE 473: Artificial Intelligence Uncertainty, Utilities. Reminder: Expectations. Reminder: Probabilities CSE 473: Artificial Intelligence Uncertainty, Utilities Probabilities Dieter Fox [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are

More information

Algorithmic Game Theory and Applications. Lecture 11: Games of Perfect Information

Algorithmic Game Theory and Applications. Lecture 11: Games of Perfect Information Algorithmic Game Theory and Applications Lecture 11: Games of Perfect Information Kousha Etessami finite games of perfect information Recall, a perfect information (PI) game has only 1 node per information

More information

Monte Carlo Methods (Estimators, On-policy/Off-policy Learning)

Monte Carlo Methods (Estimators, On-policy/Off-policy Learning) 1 / 24 Monte Carlo Methods (Estimators, On-policy/Off-policy Learning) Julie Nutini MLRG - Winter Term 2 January 24 th, 2017 2 / 24 Monte Carlo Methods Monte Carlo (MC) methods are learning methods, used

More information

POMDPs: Partially Observable Markov Decision Processes Advanced AI

POMDPs: Partially Observable Markov Decision Processes Advanced AI POMDPs: Partially Observable Markov Decision Processes Advanced AI Wolfram Burgard Types of Planning Problems Classical Planning State observable Action Model Deterministic, accurate MDPs observable stochastic

More information

Worst-Case vs. Average Case. CSE 473: Artificial Intelligence Expectimax, Uncertainty, Utilities. Expectimax Search. Worst-Case vs.

Worst-Case vs. Average Case. CSE 473: Artificial Intelligence Expectimax, Uncertainty, Utilities. Expectimax Search. Worst-Case vs. CSE 473: Artificial Intelligence Expectimax, Uncertainty, Utilities Worst-Case vs. Average Case max min 10 10 9 100 Dieter Fox [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro

More information

Markov Decision Processes: Making Decision in the Presence of Uncertainty. (some of) R&N R&N

Markov Decision Processes: Making Decision in the Presence of Uncertainty. (some of) R&N R&N Markov Decision Processes: Making Decision in the Presence of Uncertainty (some of) R&N 16.1-16.6 R&N 17.1-17.4 Different Aspects of Machine Learning Supervised learning Classification - concept learning

More information

Markov Decision Processes. Lirong Xia

Markov Decision Processes. Lirong Xia Markov Decision Processes Lirong Xia Today ØMarkov decision processes search with uncertain moves and infinite space ØComputing optimal policy value iteration policy iteration 2 Grid World Ø The agent

More information

Example: Grid World. CS 188: Artificial Intelligence Markov Decision Processes II. Recap: MDPs. Optimal Quantities

Example: Grid World. CS 188: Artificial Intelligence Markov Decision Processes II. Recap: MDPs. Optimal Quantities CS 188: Artificial Intelligence Markov Deciion Procee II Intructor: Dan Klein and Pieter Abbeel --- Univerity of California, Berkeley [Thee lide were created by Dan Klein and Pieter Abbeel for CS188 Intro

More information

Basic Framework. About this class. Rewards Over Time. [This lecture adapted from Sutton & Barto and Russell & Norvig]

Basic Framework. About this class. Rewards Over Time. [This lecture adapted from Sutton & Barto and Russell & Norvig] Basic Framework [This lecture adapted from Sutton & Barto and Russell & Norvig] About this class Markov Decision Processes The Bellman Equation Dynamic Programming for finding value functions and optimal

More information

Decision making in the presence of uncertainty

Decision making in the presence of uncertainty CS 2750 Foundations of AI Lecture 20 Decision making in the presence of uncertainty Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Decision-making in the presence of uncertainty Computing the probability

More information

Microeconomics of Banking: Lecture 5

Microeconomics of Banking: Lecture 5 Microeconomics of Banking: Lecture 5 Prof. Ronaldo CARPIO Oct. 23, 2015 Administrative Stuff Homework 2 is due next week. Due to the change in material covered, I have decided to change the grading system

More information

CPS 270: Artificial Intelligence Markov decision processes, POMDPs

CPS 270: Artificial Intelligence  Markov decision processes, POMDPs CPS 270: Artificial Intelligence http://www.cs.duke.edu/courses/fall08/cps270/ Markov decision processes, POMDPs Instructor: Vincent Conitzer Warmup: a Markov process with rewards We derive some reward

More information

Action Selection for MDPs: Anytime AO* vs. UCT

Action Selection for MDPs: Anytime AO* vs. UCT Action Selection for MDPs: Anytime AO* vs. UCT Blai Bonet 1 and Hector Geffner 2 1 Universidad Simón Boĺıvar 2 ICREA & Universitat Pompeu Fabra AAAI, Toronto, Canada, July 2012 Online MDP Planning and

More information

Sequential Decision Making

Sequential Decision Making Sequential Decision Making Dynamic programming Christos Dimitrakakis Intelligent Autonomous Systems, IvI, University of Amsterdam, The Netherlands March 18, 2008 Introduction Some examples Dynamic programming

More information

10703 Deep Reinforcement Learning and Control

10703 Deep Reinforcement Learning and Control 10703 Deep Reinforcement Learning and Control Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu Temporal Difference Learning Used Materials Disclaimer: Much of the material and slides

More information

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2017

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2017 ECON 459 Game Theory Lecture Notes Auctions Luca Anderlini Spring 2017 These notes have been used and commented on before. If you can still spot any errors or have any suggestions for improvement, please

More information

Optimal Satisficing Tree Searches

Optimal Satisficing Tree Searches Optimal Satisficing Tree Searches Dan Geiger and Jeffrey A. Barnett Northrop Research and Technology Center One Research Park Palos Verdes, CA 90274 Abstract We provide an algorithm that finds optimal

More information

MDPs and Value Iteration 2/20/17

MDPs and Value Iteration 2/20/17 MDPs and Value Iteration 2/20/17 Recall: State Space Search Problems A set of discrete states A distinguished start state A set of actions available to the agent in each state An action function that,

More information

Lecture 2: Making Good Sequences of Decisions Given a Model of World. CS234: RL Emma Brunskill Winter 2018

Lecture 2: Making Good Sequences of Decisions Given a Model of World. CS234: RL Emma Brunskill Winter 2018 Lecture 2: Making Good Sequences of Decisions Given a Model of World CS234: RL Emma Brunskill Winter 218 Human in the loop exoskeleton work from Steve Collins lab Class Structure Last Time: Introduction

More information

Lecture outline W.B.Powell 1

Lecture outline W.B.Powell 1 Lecture outline What is a policy? Policy function approximations (PFAs) Cost function approximations (CFAs) alue function approximations (FAs) Lookahead policies Finding good policies Optimizing continuous

More information

Reasoning with Uncertainty

Reasoning with Uncertainty Reasoning with Uncertainty Markov Decision Models Manfred Huber 2015 1 Markov Decision Process Models Markov models represent the behavior of a random process, including its internal state and the externally

More information

MA300.2 Game Theory 2005, LSE

MA300.2 Game Theory 2005, LSE MA300.2 Game Theory 2005, LSE Answers to Problem Set 2 [1] (a) This is standard (we have even done it in class). The one-shot Cournot outputs can be computed to be A/3, while the payoff to each firm can

More information

FDPE Microeconomics 3 Spring 2017 Pauli Murto TA: Tsz-Ning Wong (These solution hints are based on Julia Salmi s solution hints for Spring 2015.

FDPE Microeconomics 3 Spring 2017 Pauli Murto TA: Tsz-Ning Wong (These solution hints are based on Julia Salmi s solution hints for Spring 2015. FDPE Microeconomics 3 Spring 2017 Pauli Murto TA: Tsz-Ning Wong (These solution hints are based on Julia Salmi s solution hints for Spring 2015.) Hints for Problem Set 3 1. Consider the following strategic

More information

Martingale Pricing Theory in Discrete-Time and Discrete-Space Models

Martingale Pricing Theory in Discrete-Time and Discrete-Space Models IEOR E4707: Foundations of Financial Engineering c 206 by Martin Haugh Martingale Pricing Theory in Discrete-Time and Discrete-Space Models These notes develop the theory of martingale pricing in a discrete-time,

More information

IEOR 3106: Introduction to Operations Research: Stochastic Models SOLUTIONS to Final Exam, Sunday, December 16, 2012

IEOR 3106: Introduction to Operations Research: Stochastic Models SOLUTIONS to Final Exam, Sunday, December 16, 2012 IEOR 306: Introduction to Operations Research: Stochastic Models SOLUTIONS to Final Exam, Sunday, December 6, 202 Four problems, each with multiple parts. Maximum score 00 (+3 bonus) = 3. You need to show

More information

Sublinear Time Algorithms Oct 19, Lecture 1

Sublinear Time Algorithms Oct 19, Lecture 1 0368.416701 Sublinear Time Algorithms Oct 19, 2009 Lecturer: Ronitt Rubinfeld Lecture 1 Scribe: Daniel Shahaf 1 Sublinear-time algorithms: motivation Twenty years ago, there was practically no investigation

More information